Amendments to the Claims: 



This listing of claims will replace all prior versions, and listings of claims in the 
application. Applicant has submitted a new complete claim set showing marked up 
claims with insertions indicated by underlining and deletions indicated by strikeouts 
and/or double bracketing. 



Listing of Claims: 

1 . (Currently amended) In a database system, a sampling method for 

constructing a data structure based on the contents of a database comprising: 

selecting a) gather i ng an initial sample of data from the database , the initial 
sample of data including one or more subparts: 

cross-validating a plurality of subparts of the and creat i ng a f i rst data structure 
froffl — said — initial data sample , the cross-validating associated with an error 
corresponding to a subpart : 

sorting substantially simultaneously with the cross-validating the plurality of 
subparts to generate a plurality of cross-validation errors: 

generating an estimated block size based on the sorting and cross-validating: 

selecting an additional b) gather i ng a second sample of dat a, wherein the size of 
the selected additional sample of data corresponds to the generated estimated block 



merging the 

c) dotorm i n i ng an i n i t i a l sufficiency of the data gathered from the database that 
i s based on a compar i son of the f i rst data structure and the second samp l e of data; and 

d) form i ng a resu l tant data structure by gather i ng an additional sample of data 
with the from the database and us i ng the additional amount of data to form th e 

Type of Response: Amendment 
Application Number: 10/814,382 
Attorney Docket Number: 30751 7.01 
Filing Date: March 31, 2004 

2/13 



i l tant data structure whoro i n the amount of data gathorod in the add i t i ona l samp l e 
}d on the initial sample of data suff i c i encv determ i nat i on . 



2. (Currently amended) The method of claim 1 wherein the cross- 
validating includes cross-validating subparts of resu l tant data structure i s formed based 
on data gathered i n t he initial sample of data that are of different sizes, the second 
samp l e and the add i t i ona l samp l e . 

3. (Currently amended) The method of claim 1 wherein the cross- 
validating and sorting are combined in a single step f i rst and resu l tant data structures 
are h i stograms . 

4. (Currently amended) The method of claim +-3_wherein the single 
step includes: 

dividing the initial sample of and second data into multiple subparts: 
sorting and cross-validating the multiple subparts recursively samplcs arc 
random l y retr i eved b l ock samp l es that form a f i rst amount of data that i s i n i t i a ll y 
gathorod and then d i v i ded i n ha l f to prov i de the i n i t i a l and second data samp l es . 

5. (Currently amended) The method of claim 4 wherein the single 
step further includes: 

building a histogram for at least a first subpart and a second subpart of the 
initial sample of and second data: 

testing the histogram of the first subpart against the second subpart to generate 
a cross-validation error estimate for a sample size corresponding to the initial sample of 
dat a samp l es are sorted and used to form two h i stograms . 
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6. (Currently amended) The method of claim 5 further comprising 
reusing parts of the initial sample of data to generate different cross-validation wto=6w 
an-error estimates, each of the cross-validation error estimates corresponding to an 
associated sample size mctr i c of the two h i stograms arc formed by cross corre l at i ng the 
contents of the two h i stograms to determ i ne the i n i t i a l suff i c i ency . 

7. (Currently amended) The method of claim 6 wherein generating 
the estimated block size includes: 

computing means of the different cross-validation error estimates for each of the 
associated the i n i t i a l and second data samp l es arc further sub d i v i ded to form sub 
samp l es used to form other h i stograms of d i ffering sample sizes; 

determining a best fit of the means of the different cross-validation error 
estimates: 

estimating the block size based on the determined best fit that arc cross 
corre l ated to f i nd an error metric re l at i ng to sa i d d i ffer i ng samp l e s i zes . 

8. (Currently amended) The method of claim €-7_wherein 
determining t he best fit includes identifying a best fitting curve associated with the 
means of the different cross-validation in i t i a l and socond data samp l es are further sub - 
divided to form add i t i ona l sub samp l es of sma ll er s i ze that arc used to form other 
h i stograms that arc cross corre l ated for use i n f i nd i ng an error estimates motr i c re l at i ng 
to samp l e s i zes for use i n determ i n i ng a s i ze of the add i t i ona l samp l e of data to gather 
from the database . 
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9. (Currently amended) The method of claim 4 -8 wherein identifying 
a best fitting curve includes: 

generating the best fitting curve of the form =clr . wherein c is a constant. 
A^is an average squared cross-validation error observed for a given sample size, and 
r ^represents the given sample size: 

estimating the block size based on the contant c ^add i t i ona ll y compr i s i ng 
est i mat i ng dist i nct va l ues of an attr i bute of the i n i t i a l and second samp l es by e li m i nat i ng 
records from the b l ocks that arc dup li cated w i th i n a g i ven b l ock and est i mat i ng d i st i nct 
va l ues by categor i z i ng attr i butes as rare l y or frequent l y occurr i ng w i th i n the database . 

1 0. (Original) A computer readable medium for performing computer 
instructions to implement the method of claim 1 . 

1 1 . (Currently amended) A database system for constructing histograms 
based on sampling the contents of the database comprising: 

a) a database management component that gathers block size data segments 
from the database which in aggregate form a first sample of data having a first size; 

b) a histogram construction component that forms a first histogram from the 
first sample of data; and 

c) a correlation component that cross-validat es a plurality of subparts of the 
initial sample of data and sorts substantially simultaneously with the cross-validating 
the plurality of subparts to generate a plurality of cross-validation errors. dGtcrm i nGS an 
i n i t i a l suff i c i ency of the f i rst samp l e of data gathered from the database based on a 
compar i son of the f i rst h i stogram and data from the f i rst samp l e of data; 
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d)-wherein said database management component gathers an additional sample 
of data used by said histogram construction component in creating a resultant 
histogram corresponding to a combination of the additional sample and the initial 
sample of data, a nd-the size of the additional sample being 4 s-based on the cross- 
validation errors i n i t i a l suff i c i ency determ i nat i on . 

12. (Currently amended) The system of claim 1 1 wherein the resultant 
histogram is formed by the histogram construction component based on data gathered 
in the first sample of data and the additional sample of data. 

1 3. (Original) The system of claim 1 1 wherein the first sample of data and 
the additional sample of data are randomly retrieved block samples. 

14. (Original) The system of claim 1 1 wherein histogram construction 
component sorts the data in said first sample of data as it constructs the first histogram. 

1 5. (Currently amended) The system of claim 1 1 wherein the correlation 
component determines the cross-validation errors an error metr i c by cross correlating 
the contents of the first histogram with other data in said first sample of data to 
determine ant*^ initial sufficienc y of the first sample of data gathered from the 
database . 

1 6. (Currently amended) The system of claim 1 5 wherein the first sample of 
data is sub-divided to form sub samp l os the subparts used to form histograms of 
differing sizes that are cross correlated to find a cross-validation a n-error metr i c relating 
to said differing sample sizes. 
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1 7. (Currently amended) The system of claim 1 5 wherein the first sample of 
data is sub-divided to form additional sub samp l os subparts of smaller size that are 
used to form other histograms that are cross correlated for use in finding cross- 
validation erros an error metr i c relating to sample sizes for use in determining a size of 
the additional sample of data to gather from the database. 

1 8. (Currently amended) In a database system, a sampling method for 
constructing a histogram based on the contents of a database comprising: 

a) gathering an initial sample of data from the database and creating a histogram 
from said initial sample; 

b) gathering a second sample of data from the database for comparison with said 
first histogram; 

c) determining an initial sufficiency of the data gathered from the database that 
is based on a comparison of the second sample with the first histogram , including 
cross-validating and sorting a plurality of portions of the data substantially 
simultaneously : and 

d) if the determination of initial sufficiency indicates the data in said initial and 
second samples is adequate to represent the database, combining the initial and second 
samples to form a resultant histogram, but if the determination of initial sufficiency 
indicates the initial and second samples are inadequate to represent the database, 
gathering an additional data sample to combine with the initial and second samples to 
form the resultant histogram wherein a size of the additional data sample is based on 
the initial sufficiency determination. 
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19. (Original) The method of claim 1 8 wherein the data is gathered in 
blocks from random storage locations within the database. 

20. (Currently amended) In a database system, a system for constructing a 
data structure based on the contents of a database comprising: 

a) means for gathering an initial sample of data from the database and creating a 
first data structure from said initial sample; 

b) means for determining an initial sufficiency of the data gathered from the 
database that is based on a comparison of the first data structure and other data in the 
initial sample not used to create the first data structure , the comparison being based on 
cross-validating and substantially simultaneously sorting a plurality of portions of the 
data : and 

c) means for forming a resultant data structure by gathering an additional 
sample of data from the database and using the additional amount of data to form the 
resultant data structure wherein the amount of data gathered in the additional sample is 
based on the initial sufficiency determination. 

21 . (Original) The system of claim 20 wherein the resultant data structure 
is formed based on data gathered in the initial sample and the additional sample. 

22. (Original) The system of claim 21 wherein the first and resultant data 
structures are histograms. 

23. (Original) The system of claim 20 wherein the initial data sample is 
made up of randomly retrieved block samples that form a first amount of data that is 
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divided in half to provide data to form the data structure and data to cross correlate 
against the first data structure. 

24. (Original) The system of claim 23 wherein the initial data samples is 
sorted and used to form two histograms. 

25. (Original) The system of claim 24 wherein an error metric of the two 
histograms are formed by cross correlating the contents of the two histograms to 
determine the initial sufficiency. 

26. (Original) The system of claim 25 wherein the initial data sample is 

further sub-divided to form sub-samples used to form other histograms of differing 
sample sizes that are cross correlated to find an error metric relating to said differing 
sample sizes. 

27. (Original) The system of claim 26 wherein the initial and second data 
samples are further sub-divided to form additional sub-samples of smaller size that are 
used to form other histograms that are cross correlated for use in finding an error 
metric relating to sample sizes for use in determining a size of the additional sample of 
data to gather from the database. 

28. (Original) The system of claim 24 additionally comprising means for 
estimating distinct values of an attribute of the initial and second samples by eliminating 
records from the blocks that are duplicated within a given block and estimating distinct 
values by categorizing attributes as rarely or frequently occurring within the database. 
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29-31. (Canceled) 



32. (Original) A computer readable medium for performing computer 
instructions to implement the method of claim 20. 

33-34. (Canceled) 
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